You work for a large corporation that owns a collection of restaurants of different types. Currently it is evaluating the location, type, characteristics (cuisine, price point, design, marketing strategy) and positioning of a new restaurant in Edinburgh (if you know the local habits, you can give better insight). Your task is to analyse the dataset and give recommendations on strategy, based on the reviews, location (either neighbourhood or zipcode level) and competition. Also give estimates on volumes and revenues of the potential undertaking. You may have to check out what the price-range attributes signify by visiting the Yelp website.
This analysis is conduced for Edinburgh only. This was because extensive demographic data was found that could be easily integrated with the Yelp dataset. The team also had better knowledge of the zip codes as Edinburgh is still in the UK. This report is split into three main analyses, then concluded with an estimation of revenues and a summary.
Topic Modelling Analysis The reviews of Edinburgh’s resturants are grouped into topics using a NMF topic modelling method to look for potential topics or opportunities to exploit.
Demand and Location Analysis Summary statistics such the number of reviews per resturant and by cuisine are mapped out, to give an indiaciton of relative demand by location, which allows for a recommendation of location.
Competition and Style Analysis A metric of competition is computed for different styles of resturant to recommend the appropriate attributes of the resturant.
This analysis looks through reviews for topics or opportunities to take into account when opening a resturant in Edinburgh. For a relatively small collection of reviews, it may be possible to manually inspect and classify the contents of reviews into specific categories based on similarity, but this becomes time consuming for a dataset of 14,000+ reviews.
This section uses topic modelling to identify the topics users care about most when giving their rating stars, ultimately determining what a new restaurant should be doing in order to receive high ratings.
A non-negative matrix factorization (NMF) approach is used to detect topics from reviews. NMF simultaneously performs dimensionality reduction and clustering to identify semantic features in a document collection and groups the documents into clusters on the basis of shared semantic features [1]. The extracted topics from 1-star and 2-star reviews were used as an indicator of bad practice whereas extracted topics from 4-star and 5-star reviews were used as an indicator of good practice for operating a restaurant.
Import the datasets.
business = pd.read_csv("edinburgh.csv",header=0)
checkin = pd.read_csv("edinCheckin.csv",header=0)
review = pd.read_csv("edinReview.csv",header=0)
tip = pd.read_csv("edinTip.csv",header=0)
user = pd.read_csv("edinUser.csv",header=0,usecols=range(0,23))
The main dataset in this study was ‘yelp_academic_dataset_review.json’ and summary of the dataset is shown below.
# Replace nan with blank space
review=review.replace(np.nan,' ', regex=True)
# Summary of dataset
review.info()
The text of reviews is full of punctuations, numbers and capital letters, further cleansing of data is required for text analysis.
review.text.head(10)
Data Cleansing
Text after cleaning up the text in review dataset:
# Referenced Regular Expression for email cleanup idea:
def cleanup(text):
# Make text lower case
for f in re.findall("([A-Z]+)", text):
text = text.replace(f, f.lower())
# Remove escape symbols
text = text.replace('\r', " ")
text = text.replace('\n', " ")
# Remove all non-ascii characters in the string
text=unicode(text, 'ascii', 'ignore')
# Creata a list of reg tools
cleanuptools = [
# Dates
r"(monday|tuesday|wednesday|thursday|friday|saturday|sunday)",
# Removing months
r"january|february|march|april|may|june|july|august|september|october|november|december",
# Punctuation and numbers to be removed
r'[-|.|?|!|,|"|:|;|()|0-9]',
]
for tool in cleanuptools:
text = re.sub(tool," ", text)
return text
# Constructing a list for stopwords
stopwords = []
# Add scikit-learn's CountVectorizer's stop list to the created list
stopwords = sk.feature_extraction.text.ENGLISH_STOP_WORDS
# Apply the created functions to clean up text
review.text=review.text.apply(cleanup)
# Cleaned text
review.text[28]
Distribution of number of words
f, ax = plt.subplots(figsize=(15,7.5))
n, bins, patches = ax.hist(review.text.apply(len),facecolor='black',bins=20)
ax.set_xticks(bins)
bin_centers = 0.5 * np.diff(bins) + bins[:-1]
for count, x in zip(n, bin_centers):
percent = '{:.2f}%'.format((float(count) / n.sum())*100)
ax.annotate(percent, xy=(x, 0), xycoords=('data', 'axes fraction'),
xytext=(0, -32), textcoords='offset points', va='top', ha='center')
ax.set_xlabel('Count of words in review')
ax.set_ylabel('Count of reviews')
Only reviews with a considerable amount of contents should be included as the topic modelling will not be able to extract insights from short reviews. The length of reviews was therefore analysed and a minimum requirement on the length of review was implemented. But in general even without cutting out reviews, the number of words are high, more than 85% of reviews have more than 250 words.
The distribution of ratings was also examined to avoid inbalanced datasets.
Reviews VS Stars
review.stars.value_counts(sort=False).plot(kind='bar',color="black")
plt.title('Reviews By Star');
Inituitively, we thought there would be more 1-star and 5-star reviews, however, the data shows otherwise, and the majority of reviews were 4-star and 5-star. It is also important to check the usefulness/reliability of reviews by examining their numbers of votes.
Reviews VS Votes
The majority of reviews have only one vote regardless of vote type (cool, useful or funny). So we decided not to remove reviews based on their numbers of votes (doing so would remove a significant amount of text from this analysis).
Votes for cool
pd.crosstab(review.stars,review.votes_cool[review.votes_cool!=0],margins=True)
Votes for funny
pd.crosstab(review.stars,review.votes_funny[review.votes_funny!=0],margins=True)
Votes for useful
pd.crosstab(review.stars,review.votes_useful[review.votes_useful!=0],margins=True)
Dividing the dataset
Reviews were split into two groups, 1 and 2 star reviews were interpreted as bad reviews and 4 and 5-star reviews were interpreted as good reviews. 3-star reviews were excluded in this analysis due to the mixed expression of customers (three stars could be a good or bad review).
# Split the dataset into 2 categories: 1,2 stars and 4,5 stars
review_bad = review[(review.stars == 1) | (review.stars == 2)]
review_good = review[(review.stars == 4) | (review.stars == 5)]
review_good.text.head(10)
Example of good reviews:
Text feature extraction
A range of functions from the python scikit-learn package was used to extract features(words) from textual reviews content and create a document-term matrix, namely [1]-[3]:
In a large text corpus, stopwords are common (e.g. “the”, “a”, “is”, “and” in English) and carry little meaningful information about the actual contents of the document. In order to reduce the influence of stop words appearing frequently across the entire corpus, TF-IDF term re-weighting functions normalized the data. TfidfVectorizer function combines the functions of TF-IDF and vectorization (Vectorization is a process combining tokenization, counting and normalization) allowing to build a document-term matrix for the corpus of documents:
#Vectorization
tfidfvectorizer_bad = TfidfVectorizer(max_features=15000, ngram_range=(1, 2), stop_words = stopwords,
strip_accents="unicode", use_idf=True, norm="l2", min_df = 5)
tfidfvectorizer_good = TfidfVectorizer(max_features=15000, ngram_range=(1, 2), stop_words = stopwords,
strip_accents="unicode", use_idf=True, norm="l2", min_df = 5)
#Create term document matrix for separate datasets
term_document_matrix_bad = tfidfvectorizer_bad.fit_transform(review_bad.text)
term_document_matrix_good = tfidfvectorizer_good.fit_transform(review_good.text)
Topic Modelling
NMF (Non-negative Matix Factorization) is a method of unsupervised learning for grouping a collection of documents and finding out abstract topics. The scikit-learn package has an implementation of NMF with NNDSVD initialization.
Nonnegative Double Singular Value Decomposition (NNDSVD) is typically used for overcoming sparseness of data in document-term matrix [8]. Here we set the number of topics to be equal to 25 and run NMF for 200 iterations, then get the factors W and H from the resulting model:
# 1-star + 2-star reviews
nmfmodel_bad = sk.decomposition.NMF(init="nndsvd", n_components=25, max_iter=200).fit(term_document_matrix_bad)
W_bad = nmfmodel_bad.fit_transform(term_document_matrix_bad);
H_bad = nmfmodel_bad.components_
# W (number of reviews,number of topics) and H (number of topics, number of features)
print "Generated factor W of size %s and factor H of size %s for bad reviews" \
% ( str(W_bad.shape), str(H_bad.shape) )
# 4-star + 5-star reviews
nmfmodel_good = sk.decomposition.NMF(init="nndsvd", n_components=25, max_iter=200).fit(term_document_matrix_good)
W_good = nmfmodel_good.fit_transform(term_document_matrix_good);
H_good = nmfmodel_good.components_
# W (number of reviews,number of topics) and H (number of topics, number of features)
print "Generated factor W of size %s and factor H of size %s for good reviews" \
% ( str(W_good.shape), str(H_good.shape) )
#Write functions to plot graphs which present the top topics discovered by the NMF model and create a list of features generated from the NMF model.
# Create a colour series for graph plotting
def grey_color_func(word, font_size, position, orientation, random_state=None, **kwargs):
return "hsl(0, 0%%, %d%%)" % random.randint(60, 100)
# Writting functions to show top words in each topic
def TopTermsByTopic(nmfmodel, features, top):
for index, topic in enumerate(nmfmodel.components_):
print "\n Topic {}: \n".format(index+1)
print "Percentage of Words: {:.2%}\n".format(np.count_nonzero(topic) / 41961.)
top_words = [features[i] for i in nmfmodel.components_[index].argsort()[::-1][:top]]
topic_words = ' '.join(top_words)
#Prepare data for horizontal bar charts
top15_index = nmfmodel.components_[index].argsort()[::-1][:15]
top15_topic = sorted(topic[top15_index],reverse=False)
#Prepare data for wordclouds
wc = WordCloud(max_font_size=80,relative_scaling=.5,width=800,height=500).generate(topic_words)
#Create a space for graphs
fig, ax = plt.subplots(2,figsize=(12,10))
rect1 = ax[0].barh(.5 + np.arange(15) + .5, top15_topic, color="black", align="center")
rect2 = ax[1].imshow(wc.recolor(color_func=grey_color_func, random_state=3))
#Subplot 1 - Horizontal Bar Chart
ax[0].set_title("Top 15 Terms in Topic {}".format(index + 1))
ax[0].set_xlabel("Weight")
ax[0].set_yticks(.5 + np.arange(15)+ .5)
ax[0].set_yticklabels([features[i] for i in topic.argsort()[::-1][:15]])
ax[0].grid(True)
#Subplot 2 - WordCloud
ax[1].axis("off")
ax[1].set_title("Wordcloud of Topic {}".format(index + 1))
#Show the graphs
plt.tight_layout()
plt.show()
#Extracting the feature names
features_bad= tfidfvectorizer_bad.get_feature_names()
features_good= tfidfvectorizer_good.get_feature_names()
Horizontal barchart of the top 15 highest weighted terms for the most significant topic - in bad and good reviews - discovered by the NMF model. The rest of the terms are shown in a word cloud.
TopTermsByTopic(nmfmodel_bad, features_bad, 100)
Example of Topic 1 for bad reviews:
TopTermsByTopic(nmfmodel_good, features_good, 100)
Example of Topic 1 for good reviews:
A summary of the 25 topics from the bad and good reviews - NMF model. An attempt to interpret the contents of each topic was made, keywords in each topic were manually examined and a description of each topic was then assigned. Experiments suggest 25 topics is optimal, allowing for a clear separation of topics.
data_bad = {'Index of Topics':['Topic 1','Topic 2','Topic 3','Topic 4','Topic 5','Topic 6','Topic 7','Topic 8'
,'Topic 9','Topic 10','Topic 11','Topic 12','Topic 13','Topic 14','Topic 15'
,'Topic 16','Topic 17','Topic 18','Topic 19','Topic 20',
'Topic 21','Topic 22','Topic 23','Topic 24','Topic 25']
,'Type of Topics':['Bad service from manager and waiter/waitress ', 'Bad Coffee Shop',
'Bad Chicken Dishes (Fried, Boiled, Curry)',
'Bad Italian Foods (Pizza and Pasta: Toppings, base, sauce, etc.)',
'Bad Fried Chips (Soggy Batter)','Bad Burger', 'Unknown Topic',
'Bad wait and time management','Bad Experience and Services',
'Bad Restaurant (Birthplace of Harry Potter)','Bad Afternoon Tea',
'Unfriendly and Rude Staffs','Bad Chinese Sweet and Sour Foods',
'Unknown Topic','Bad Mexican Foods','Bad Place for drinks (too quiet)',
'Bad Japanese Foods (Tuna, Miso Soup)','Bad Prices','Bad Breakfast (Eggs Benedict)',
'Bad Wait and Time Management','Bad Noodles', 'Good Comments in Bad Reviews',
'Expensive Place for Tourists', 'Hot Temperature','Bad Thai Foods'
]
}
topic_table_bad = pd.DataFrame(data_bad)
display(topic_table_bad)
data_good = {'Index of Topics':['Topic 1','Topic 2','Topic 3','Topic 4','Topic 5','Topic 6','Topic 7','Topic 8'
,'Topic 9','Topic 10','Topic 11','Topic 12','Topic 13','Topic 14','Topic 15'
,'Topic 16','Topic 17','Topic 18','Topic 19','Topic 20',
'Topic 21','Topic 22','Topic 23','Topic 24','Topic 25']
,'Type of Topics':['Unknown Topic','Good Quality foods','Good Place and Atmosphere',
'Good Bars and Pubs', 'Good Scottish Breakfast',
'Good Fish and Chips with nice peas','Good Thai Foods with decent prawn',
'Good Indian Foods', 'Good Menu','Good Beef Burger with decent sweet potatoes',
'Good Price and Value','Good Sandwiches', 'Good Italian Foods (Pasta and Pizza)',
'Good chocolates and ice creams','Good Coffee Shops with Nice Artisan and Expresso',
'Good Japanese foods (Bento,Nigiri,Kanpai)','Good Mexican Burritos and Tacos',
'Excellent Services', 'Good Afternoon Tea', 'Good BBQ Shops (crackling pork, haggis)',
'Unknown German Reviews', 'Friendly Staff','Good Foods', 'Good Vegetarian Restaurants',
'Good Potato Shops'
]
}
topic_table_good = pd.DataFrame(data_good)
display(topic_table_good)
Opportunities to explore when opening a restaurant in Edinburgh:
Areas to avoid and improve when opening a restaurant in Edinburgh:
For sectors where businesses are doing very well would also mean new entrants could face a big challenge to enter these sectors.
In conclusion, the insights which were extracted by topic modelling are very useful for planing openning a new restaurant in Edinburgh, for example, it was clear that customers are not satisfied with the current fish and chips and chicken dishes in Edinburgh, a new restaurant could potentially target these specific sectors to full fill the customer demand. Also, the resturant should be careful with certain issues from bad reviews, for example, unfriendly staff, long waitign times, and uncomfortable room temperatures frequently impacted the customer experience. Finally, sectors that are doing very well in Edinburgh such as bars, pubs, BBQ and Potato shops are areas that a new restraunt may not want to consider to avoid fierce competition.
This section estimates the demand for different restaurants per Edinburgh zip code, and looks at the competition per zip. The following steps were performed:
The pandas Python package was used to filter the yelp data to resturants, check-ins, reviews, tips, and users in Edinburgh. This provided data on 1215 restaurants in Edinburgh registered on Yelp.
Regarding restaurant cuisine type, where many different attributes were used to explain the restaurant category, the most expressive type of cuisine type was assigned. For example a resturant might have “Gastropubs,Bars,Scottish,Nightlife,Restaurants”, so the cuisine keyword “Scottish” might be assigned.
The first three or four letters of each resturant’s zip code was extracted from their address, which enables the joining of external demographic data. For example EH6 or EH12.
The data obtained from the Edinburgh government website [4] provided locality and ward demographic profiles. It contained data on; gender, age, housing, employment, education and professions, income, benefits, health and disability, lifestyle, satisfaction with services, and Scottish Index of Multiple Deprivation data - of which population, income and property prices were of primary interest. However, this data was organized by ward, with each ward containing many zip codes. Figure 1 maps different polling districts and figure 2 maps wards districts.
“Figure 1: Polling Districts”
“Figure 2: Wards”
To attribute ward demographic data to each zip an external data set of polling districts from the City of Edinburgh Mapping portal [5] was used. Each polling location has a zip code and a ward. When a zip code makes up the majority of occurances in a ward, the demographic data is all attributed to that zip. For example if the ward “Almond” has 4 zip codes in EH4 and 1 in EH2, EH4 will be considered as ward Almond, and takes the Almond demographic data.
As some zip codes such as EH1 and EH2 are really small, they were not assigned to any single ward using the method above. For these zip codes, population numbers were attributed in proportion to the number of polling stations zip’s in the ward. For example the central city ward in general is made up of EH1 and EH2. So the population for central city ward is assigned evenly into the populations of EH1 and EH2. Income and property prices remained the same.
Data on all the hotels in Edinburgh was also downloaded from booking.com and integrated. This data was cleansed, to get hotel counts by zip code.
import json
import pandas as pd
from glob import glob
def convert(x):
''' Convert a json string to a flat python dictionary
which can be passed into Pandas. '''
ob = json.loads(x)
for k, v in ob.items():
if isinstance(v, list):
ob[k] = ','.join(v)
elif isinstance(v, dict):
for kk, vv in v.items():
ob['%s_%s' % (k, kk)] = vv
del ob[k]
return ob
def convert2(x):
''' Convert a json string to a flat python dictionary
which can be passed into Pandas. '''
ob = json.loads(x)
for k, v in ob.items():
if isinstance(v, list):
ob[k] = ','.join(str(v))
elif isinstance(v, dict):
for kk, vv in v.items():
ob['%s_%s' % (k, kk)] = vv
del ob[k]
return ob
business = pd.DataFrame([convert(line) for line in file("yelp_academic_dataset_business.json")])
checkin = pd.DataFrame([convert(line) for line in file("yelp_academic_dataset_checkin.json")])
review = pd.DataFrame([convert(line) for line in file("yelp_academic_dataset_review.json")])
tip = pd.DataFrame([convert(line) for line in file("yelp_academic_dataset_tip.json")])
user = pd.DataFrame([convert2(line) for line in file("yelp_academic_dataset_user.json")])
cusine = ['Scottish', 'Mediterranean','Chinese','French','Italian', 'Thai', 'Indian', 'British', 'European', 'Greek', 'Nepalese', 'German', 'Turkish', 'Mexican', 'Pakistan', 'Seafood', 'Japanese', 'Vegetarian', 'Brazilian', 'American','Spanish','Bakeries', 'Gastropubs', 'Soup','Caterers','Cafes','Smoothies','Bistros','Pubs', 'Coffee & Tea','Burgers', 'Delis','Fast Food', 'Sandwiches', 'Pizza', 'Fish & Chips', 'Polish', 'African', 'Korean', 'Middle Eastern', 'Creperies', 'Brasseries','Chicken Wings' ]
def cusineCategory(x):
for cus in cusine:
category = x.categories.encode('ascii','ignore')
if cus in category:
return cus
def getZip(x, name):
start= x[name].find("EH")
Zip = x[name][start:start + 4].replace(" ", "")
return Zip
edinburgh = business[business.city == "Edinburgh"]
edinburgh = edinburgh[edinburgh['categories'].str.contains("Restaurants")]
edinburgh['cusine'] = edinburgh.apply(cusineCategory, axis = 1)
edinburgh['zip'] = edinburgh.apply(getZip,args=('full_address',),axis=1)
edinburgh = edinburgh[edinburgh.zip != "EH17"]
edinburghid = pd.DataFrame(edinburgh.business_id)
edinReview = pd.merge(edinburghid,review, how='left' ,on="business_id")
edinCheckin = pd.merge(edinburghid,checkin, how='left' ,on="business_id")
edinTip = pd.merge(edinburghid,tip, how='left' ,on="business_id")
edinUserIds = pd.concat([edinReview.user_id,edinTip.user_id])
uniqueIds = pd.DataFrame(edinUserIds.unique())
uniqueIds.columns = ['user_id']
edinUsers = pd.merge(uniqueIds,user, how = 'left' , on = 'user_id')
edinburgh.to_csv('edinburgh.csv', encoding = 'utf-8')
edinReview.to_csv('edinReview.csv', encoding = 'utf-8')
edinCheckin.to_csv('edinCheckin.csv', encoding = 'utf-8')
edinTip.to_csv('edinTip.csv',encoding = 'utf-8')
edinUsers.to_csv('edinUser.csv',encoding='utf-8')
polling = pd.read_csv('polling.csv')
polling['zip'] = polling.apply(getZip,args=('POSTCODE',),axis=1)
polling = polling[['NEWWARD','zip']]
pollGroup = polling.groupby(['NEWWARD','zip']).size().sort_values(ascending=False)
pollGroup
wardsList = ['Almond' , 'Drum Brae Gyle' ,'Pentland Hills', 'Forth' , 'Forth.1', 'Inverleith' , 'Corstorphine Murrayfield' , 'Sighthill Gorgie' , 'Colinton Fairmilehead' , 'Fountainbridge Craiglockhart' , 'Meadows Morningside' , 'City Centre' ,'City Centre.1', 'Leith Walk' , 'Leith ' , 'Craigentinny Duddingston' , 'Southside Newington' ,'Southside Newington.1','Liberton Gilmerton' , 'Portobello Craigmillar']
wards = pd.read_csv('rawwards.csv')
wards['Forth.1'] = wards['Forth']
wards['Forth.1'][0] = "EH5"
wards['Forth.1'][1:17]= pd.to_numeric(wards['Forth'][1:17], errors='coerce')/2
wards['Forth'][1:17] = wards['Forth.1'][1:17]
wards['Southside Newington.1'] = wards['Southside Newington']
wards['Southside Newington.1'][0] = "EH9"
wards['Southside Newington.1'][1:17]= pd.to_numeric(wards['Southside Newington'][1:17], errors='coerce')/2
wards['Southside Newington'][1:17] = wards['Southside Newington'][1:17]
wards['City Centre.1'] = wards['City Centre']
wards['City Centre.1'][0] = "EH2"
wards['City Centre.1'][1:17]= pd.to_numeric(wards['City Centre'][1:17], errors='coerce')/2
wards['City Centre'][1:17] = wards['City Centre.1'][1:17]
wardsMelt = pd.melt(wards, id_vars=['Indicator'], value_vars=wardsList)
zipCode= ""
wardName = ""
wardsMelt['Zip'] = ""
for i, row in wardsMelt.iterrows():
if row['Indicator'] == 'Zip':
zipCode = row['value']
wardsMelt['Zip'][i] = zipCode
wardsMelt = wardsMelt[wardsMelt.Indicator != 'Zip']
wardsMelt.columns = ['indicator','type','value','Zip']
wardsMelt[wardsMelt['type'] == 'City Centre']
wardsMelt.to_csv('edinPopulation.csv')
Figure 3 below shows the number of stores in Edinburgh. Figure 4 shows the type of cuisine that has the most stores. In general, Italian and British cuisine are the most popular. The city centre (zip codes EH1 and EH2) contained the greatest number of restaurants.
“Figure 3: Density Map of Restaurants / Figure 4 Number of Restaurants by Cuisine”
The relationship between cuisine type and the number of reviews (demand) was also inspected (figure 5).
“Figure 5: Cuisines with Reviews”
Figure 5 shows that British food has a very high number of reviews in Edinburgh, despite there being more Italian resturants than British (figure 4). It suggests that British restaurants are quite popular and under supplied.
In certain places, such as EH2 (Figure 6) there are substantially more Italian resturants than British. It might be worth capitalising on the gap in the market in EH2.
“Figure 6 EH2 Number of Stores and Amount of Cuisines in Edinburgh”
Interestingly, the restaurants in the city centre (EH1 and EH2) had the lowest rating average rating (Figure 7, lightpatch in centre), which is contrary to intuition: the centre should have the best resturants. This could be explained by many of the restaurants in the city centre having low scores which pull down the overall average. If one had wanted to keep the resturant in a high ratings area (which attracts visitors) then the restaurant should not be placed in the city centre.
“Figure 7: Average rating per zip code and number of resturants by cuisine”
The demand a resturant or zip recieves was proxied using the amount of reviews. Note that the number of users could have also been used as an proxy, but was decided against as users can visit a restaurant twice and give two reviews. Figure 8 shows the locations with the most reviews: EH1, EH2, EH3 and EH8. The analysis is focused on these areas.
“Figure 8: Number Reviews per Restaurant and zip”
stats = ['All','Average annual household income','Average property value']
population = wardsMelt[wardsMelt.indicator == stats[0]]
population.value = pd.to_numeric(population.value)
populationGroup = population.groupby('Zip').value.sum()
income = wardsMelt[wardsMelt.indicator == stats[1]]
income.value = pd.to_numeric(income.value)
incomeGroup = income.groupby('Zip').value.sum()
propertyValue = wardsMelt[wardsMelt.indicator == stats[2]]
propertyValue.value = pd.to_numeric(propertyValue.value)
propertyGroup = propertyValue.groupby('Zip').value.sum()
restaurantGroup = edinburgh.groupby('zip').cusine.count()
hotels = pd.read_csv('edinHotels.csv')
hotels['zip'] = hotels.apply(getZip,args=('zip',),axis=1)
hotelsGroup = hotels.groupby('zip').address.count()
statistics = pd.concat([populationGroup/restaurantGroup,incomeGroup/restaurantGroup,propertyGroup/restaurantGroup, hotelsGroup/restaurantGroup,restaurantGroup],1)
statistics.columns = ['pop/rest','income/rest','property/rest','hotels/rest','Num of Rest']
statistics = statistics.dropna()
Metrics for each zip code are shown in table 1. Included are the ratios of: total population to number of restaurants per zip, average income to the number of restaurants per zip, the total number of properties to the number of restaurants per zip, and the total number of hotels to the number of restaurants per zip.
“Table 1: Metrics on zip code”
Stating again, the analysis is focused on candidate areas with high demand EH1, EH2, EH3 and EH8. EH8 has the largest ratio of population to restaurants compared to the other candidate areas. However, it also has the lowest number of hotels per restaurant. This suggests that there will be fewer tourists in the area. In contrast, EH3 has high population to restaurant and hotel to restaurant ratios. Furthermore, EH3’s property value to restaurant ratio is low potentially making it cheaper to buy a restaurant there.
“Figure 9: Density Map of Restaurant and Number of Restaurant by Cuising for zip EH3”
Figure 9 shows British cuisine is not too saturated in EH3 as there are more Chinese restaurants than British restaurants and also a lot of Indian Italian and Thai restaurants. EH3 is therefore a potentially good area to open a British resturant.
The topic modeling analysis from the first section, found two recommendations 1) that customers are not satisfied with the type of batter for fish and chips in Edinburgh, which was complained to be too soggy and 2) that customers are also not particularly happy with chicken dishes in Edinburgh. Therefore in EH3, the resturant should focus on being really good at making chicken as well as Fish. It is further recommended that “Chicken and Fish” are the only foods offered in the restaurant so that it can specialize in making it good.
This section clusters restaurants by their attributes into 5 groups using hierachical clustering and analyses the competition in each zip code to find opportunities by resturant group.
Groups are defined as classy, general, fastfood, stylish pub & bars, and other. A metric of competition is defined for these groups for each zip code to recommend the attributes of the new resturant.
There are a total 1215 restaurants in dataset, with 55 having no attributes at all. Attributes are factors such as casual, classy, open bar, desert, good for groups, good for children, etc. These 1215 resturants are first clustered into 5 different categories.
After considering different distance metrics and linkages, the Euclidean distance and ward linkage was chosen for their simplicity and better structured dendrogram.
#Consider a few different linkage
df3_filtered=df3.loc[(df3!=0).any(1)]
df3_filtered=df3_filtered.iloc[:,0:58]
R1=hclst(df3_filtered.iloc[:,1:58],"ward","euclidean")
df3_filtered
#To determine cuts by different measures
from scipy.cluster.hierarchy import inconsistent
depth = 20
incons1 = inconsistent(R1[0], depth)
incons1[-20:]
plt.figure(figsize=(25, 10))
plt.title('Inconsistency Measure of last 20 Merges with deth ' + str(depth))
plt.xlabel('Merges')
plt.ylabel('Inconsistency')
plt.plot(incons1[-20:,3])
plt.show()
#Determine cuts By Elbow
plt.figure(figsize=(25, 10))
plt.title('Height of the last 20 Merges')
plt.xlabel('Merges from the last merge')
plt.ylabel('Height')
last = R1[0][-20:, 2]
last_rev = last[::-1]
idxs = np.arange(1, len(last) + 1)
plt.plot(idxs, last_rev)
plt.show()
plt.figure(figsize=(25, 10))
plt.title('Rate of increase in Height between Merges')
plt.xlabel('Merges from the last merge')
plt.ylabel('Rate')
acceleration = np.diff(last, 2) # 2nd derivative of the distances
acceleration_rev = acceleration[::-1]
plt.plot(idxs[:-2] + 1, acceleration_rev)
plt.show()
from scipy.cluster.hierarchy import fcluster
max_d = 15
clusters = fcluster(R1[0], max_d, criterion='distance')
df3_filtered['cluster']=clusters
“Fig 10: Dendrogram on style clustering”
Now to determine cuts by different measures.
#To determine cuts by different measures
from scipy.cluster.hierarchy import inconsistent
depth = 20
incons1 = inconsistent(R1[0], depth)
incons1[-20:]
plt.figure(figsize=(25, 10))
plt.title('Inconsistency Measure of last 20 Merges with deth ' + str(depth))
plt.xlabel('Merges')
plt.ylabel('Inconsistency')
plt.plot(incons1[-20:,3])
plt.show()
#Determine cuts By Elbow
plt.figure(figsize=(25, 10))
plt.title('Height of the last 20 Merges')
plt.xlabel('Merges from the last merge')
plt.ylabel('Height')
last = R1[0][-20:, 2]
last_rev = last[::-1]
idxs = np.arange(1, len(last) + 1)
plt.plot(idxs, last_rev)
plt.show()
plt.figure(figsize=(25, 10))
plt.title('Rate of increase in Height between Merges')
plt.xlabel('Merges from the last merge')
plt.ylabel('Rate')
acceleration = np.diff(last, 2) # 2nd derivative of the distances
acceleration_rev = acceleration[::-1]
plt.plot(idxs[:-2] + 1, acceleration_rev)
plt.show()
from scipy.cluster.hierarchy import fcluster
max_d = 15
clusters = fcluster(R1[0], max_d, criterion='distance')
df3_filtered['cluster']=clusters
“Fig 11: Inconsistency Measure of last 20 merges”
“Fig 12: Height of the last 20 merges”
“Fig 13: Rate of increase in Height of the last 20 merges”
From the plots, a cut with 5 clusters gives the most desired result. The each cluster’s attribute count as a percentage of the population is shown below, as barplots.
from scipy.cluster.hierarchy import fcluster
max_d = 15
clusters = fcluster(R1[0], max_d, criterion='distance')
df3_filtered['cluster']=clusters
#To visualise results
C1=df3_filtered[df3_filtered.cluster==1]
C1=C1.drop(C1.columns[57], axis=1)
C1=C1.describe()
C2=df3_filtered[df3_filtered.cluster==2]
C2=C2.drop(C2.columns[57], axis=1)
C2=C2.describe()
C3=df3_filtered[df3_filtered.cluster==3]
C3=C3.drop(C3.columns[57], axis=1)
C3=C3.describe()
C4=df3_filtered[df3_filtered.cluster==4]
C4=C4.drop(C4.columns[57], axis=1)
C4=C4.describe()
C5=df3_filtered[df3_filtered.cluster==5]
C5=C5.drop(C5.columns[57], axis=1)
C5=C5.describe()
plt.figure(figsize=(25, 10))
plt.title('Cluster 1:Unknown Restaurants')
plt.xlabel('Attributes')
plt.ylabel('Proportion of population')
C1.ix[1,1:56].plot(kind='bar',ylim=[0,1])
df3_filtered[df3_filtered.cluster==1]
Cluster 1
“Fig 14: Cluster 1: Unknowns”
Cluster 1 contains 382 restaurants. Most of these restaurants do not provide much information for classification.
plt.show()
plt.figure(figsize=(25, 10))
plt.title('Cluster 2:Classy Restaurants')
plt.xlabel('Attributes')
plt.ylabel('Proportion of population')
C2.ix[1,0:56].plot(kind='bar',ylim=[0,1])
df3_filtered[df3_filtered.cluster==2]
Cluster 2
“Fig 15: Cluster 2: The Classies”
Cluster 2 contains 119 restaurants. They are the most expensive and classy type of restaurants (3x$), with classy, romantic, intimate, trendy and upscale atmospheres. The bar plot shows they are characterised by a full bar, dinner service, waiters, and taking reservations. Most of them require customers to be ‘dressy’, and are good for groups and dating. They are more quiet and some provide background music.
plt.show()
plt.figure(figsize=(25, 10))
plt.title('Cluster 3:Stylish restaurants and Pubs & Bars')
plt.xlabel('Attributes')
plt.ylabel('Proportion of population')
C3.ix[1,0:56].plot(kind='bar',ylim=[0,1])
df3_filtered[df3_filtered.cluster==3]
Cluster 3
“Fig 16: Cluster 3: The Stylish and Pubs & Bars”
Cluster 3 contains 310 restaurants. These are stylish restaurants and pubs & bars priced at level 2 (2x$sign on yelp => $11-30 ), providing mainly dinner and lunch.
These are less quiet, few of them can even be loud or very loud. These clusters contain most of the restaurants that provide dj, jukebox, live and video music and also TV boardcast. Some of these restaurants are quite stylish with lipster, intimate or trendy styles. They are mainly good for groups.
plt.show()
plt.figure(figsize=(25, 10))
plt.title('Cluster 4: Fastfood & Takeaways')
plt.xlabel('Attributes')
plt.ylabel('Proportion of population')
C4.ix[1,0:56].plot(kind='bar',ylim=[0,1])
df3_filtered[df3_filtered.cluster==4]
Cluster 4
“Fig 17: Cluster 4: The Fastfood and Takeaways”
There are 171 restaurants in cluster 4. This group are the cheapest kind of restaurants ($ sign), probably fastfood restaurants as most of them do not provide waiter service, wine nor have bars. They are associated by being casual, being good for groups, and kids, having an average noise level, and takeout.
plt.show()
plt.figure(figsize=(25, 10))
plt.title('Cluster 5: General Restaurants')
plt.xlabel('Attributes')
plt.ylabel('Proportion of population')
C5.ix[1,0:56].plot(kind='bar',ylim=[0,1])
df3_filtered[df3_filtered.cluster==5]
Cluster 5
“Fig 18: Cluster 5: The Generals”
Cluster 5 contains 233 restaurants. They are general restaurants, providing brunch, dinner and lunch. They are less quiet but not the noisiest type. These restaurants are not very stylish and almost none of them provide any music. They are generally group and kids friendly.
Distribution of Resturant clusters across zip codes
df3_filtered["zip"]=df2["zip"]
df3_filtered.to_csv("C:/Users/cheukkin.Warwick/Desktop/dm_group/d3_filtered.csv")
from __future__ import division
df3_filtered["Type"]=df3_filtered.cluster
df3_filtered.Type=df3_filtered.Type.map({1:"Unknown", 2:"Classy",3:"stylish and Pubs & Bars",4:"Fastfood & Takeaways",5:"General"})
supply=pd.crosstab(df3_filtered["Type"],df3_filtered["zip"]).apply(lambda r: r*100/r.sum(), axis=1)
supply
“Table 2: Restaurants distributions across zips”
Table 2 shows the distribution of the same type of restaurants across the zips as a percentage, each row sums up to 100%. For example, 35.6% of classy resturant are located in EH1.
Distribution of good reviews across zip codes by resturant cluster
review=pd.read_csv("C:/Users/cheukkin.Warwick/Desktop/dm_group/final_group_project/Kane Wu/edinReview.csv",header=0)
review=review.merge(df3_filtered, on=['business_id'], how='outer')
review_stars=review[((review.stars==4) == True) | ((review.stars==5) == True)]
top_views=pd.crosstab(review_stars["Type"],review_stars["zip"]).apply(lambda r: r*100/r.sum(), axis=1)
top_views
“Table 3: Reviews distributions across zips”
Table 3 similarly shows the distribution of good reviews for each restaurant cluster in different zips. Each row sums to 100%. For instance, for fastfood and takeaways 39.7% of 4and 5 star reviews are in from EH1.
Measure of outperformance (or competition) per zip code
(top_views-supply)/supply
“Table 4: Ratio of disagreement of percentage distributions of review and restaurant across zips”
A ratio is created to observe the performance (or the competition) of a resturant type per zip code (table 4):
\[competition= \frac{(\text{good reviews%} - \text{restaurants%})}{\text{restaurants%}}\]
It can also be thought of as \(\frac{\text{good reviews%}}{\text{restaurants%}} -1\), which is really asking whether there are more good reviews in a zip code than one would expect as a proportion of the resturants?“. The piviot is 0, where +ve is yes and -ve is no.
A potential strategy to pursue in choosing the resturant location is to pick a zip code with a high out-performance (or competition) ratio. Since the ratio is a proxy for volumes, customers will be drawn to these places, it could also save advertising costs. However, such a strategy would also be high risk as there will be high set up costs and intense competition in a popular location.
Given that the location analysis already choosen a zip code, EH3, choosing the theme/type of resturant is all that’s left. For EH3, classy has a competition ratio of -0.2, styish bars & pubs has a ratio of -0.2, fastfood & takeaways 0.0, unknown 0.0, and general 0.2. Based off the gap in the markets in EH3, the resturant should consider being either a classy resturant or a stylish bar & pub with good fish & chips and chicken dishes. The latter makes more sense, as the population income at EH3 is not relatively high.
To calculate potential revenue, the amount of customers a restaurant may receive was estimated as a multiple of the number of british restaurant reviews. (figure 19) It’s assumed that the number of reviews indicate approximately .1% of total visits for a year. For EH3 there are 161 reviews for British restaurants and 17 British restaurants in total, hence there are 9.47 average number of Reviews per British Restaurant for EH3. There is estimated to be around 9470 customers per year (9.47 * 1000). Note that the check-in dataset could also have been used as an alternative indicator of volumes, but check-ins are based on the amount of offers and so may represent a biased value for analysis.
british = edinburgh[edinburgh.cusine == 'British']
britishGroup = british.groupby('zip').cusine.count()
edinReview = pd.merge(reviews,british, how='inner' ,on="business_id")
reviewGroup = edinReview.groupby(['zip']).size()
grouped = pd.concat([britishGroup,reviewGroup, reviewGroup/britishGroup],axis=1)
“Table5: Average number of Reviews for British Restaurant per Zip”
The average price for British Cusine at EH3 was calculated for pricing. As shown in figure19 below the average price range is around 2.53.
“Figure 19: Average Price for British Cusine”
Data on prices obtained from the Yelp site is used to calculate expected revenues. The mapping of Yelp price symbols to dollar ($) value is given in table 6.
“Table 6: Yelp Symbol to Value”
2.5 is between the second and the third price range so an estimate of 30 pounds of average spend is multiplied by 9470 which is 284,100 pounds per year. This is a fairly conservative estimate or revenues.
To summarise, the demand and location analysis suggested an appropriate place to set up the resturant is EH3, a high volume area with an under supplied market for British cuisine. The competition and theme analysis, gave additional information by providing a metric of competition (the market in EH3 had competition metrics close to zero, implying fair competition in general), and suggested that there was low competition for stylish bars & pubs. The topic analysis of reviews concluded that making good fish & chips and chicken dishes could be a potentially successful endeavour, because this was found to be an area which people frequently complained about. A potential point to avoid is making the stylish bar/pub with food too quiet, as customers had complained about that in reviews. Also the table service and waiters would be polite and speedy, and the room tempreture should be monitored.
The revenue analysis helped to set a price range for the resturant and bar at a medium price, 30 pounds, and estimated 9470 customers a year for a revenue of 284,100.
[1] D. Cai, X. He, J. Han, and T. S. Huang. Graph regularized nonnegative matrix factorization for data representation. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 33(8):1548–1560, 2011.
[2] J. Choo, C. Lee, C. K. Reddy, and H. Park. UTOPIAN: User-driven topic modeling based on interactive nonnegative matrix factorization. IEEE Transactions on Visualization and Computer Graphics (TVCG), 19(12):1992–2001, 2013
[3] A. Cichocki, R. Zdunek, A. H. Phan, and S. Amari. Nonnegative Matrix and Tensor Factorizations: Applications to Exploratory Multi-Way Data Analysis and Blind Source Separation. Wiley, 2009.
[4] Edinburgh ward and locality demographics. Sourced on June 9th 2016 from http://www.edinburgh.gov.uk/info/20247/edinburgh_by_numbers/1393/locality_and_ward_data_profiles
[5] City of Edinburgh Mapping Portal. Source on June 9th 2016 from http://data.edinburghcouncilmaps.info/datasets/2cee9b18a21344b0879c3c51d71fd2c6_28
[6] Yelp price symbol mapping to real values. Source on June 14th 2016 from http://www.yelp.com/topic/san-diego-can-anyone-give-me-the-actual-dollar-range-for-the-dollar-sign-symbols-in-rrgards-to-pricing